Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: introduce/fix TestPreVoteWithCheckQuorum #8334

Conversation

lishuai87
Copy link
Contributor

@lishuai87 lishuai87 commented Jul 29, 2017

raft: set leader id to none when becomePreCandidate.

When a follower election timeout, it should mark leader to none (so that old leader is not in lease). Current implementation of Pre-Vote ignore this, which may cause wrong judgement .

@xiang90
Copy link
Contributor

xiang90 commented Jul 29, 2017

@javaforfun can you add a test case?

@lishuai87 lishuai87 force-pushed the shawnsli/reset-leader-when-become-pre-candidate branch from fa9c6f7 to bf883bc Compare July 30, 2017 09:09
@lishuai87
Copy link
Contributor Author

@xiang90 done. add two test

@lishuai87 lishuai87 changed the title set lead = None when becomePreCandidate raft: introduce/fix TestPreVoteWithCheckQuorum Jul 30, 2017
@lishuai87
Copy link
Contributor Author

lishuai87 commented Jul 31, 2017

seems confusing when raft recv a message with higher term, i will create a new pr for this

@lishuai87 lishuai87 force-pushed the shawnsli/reset-leader-when-become-pre-candidate branch from bf883bc to 22fb6a3 Compare July 31, 2017 17:19

sm := nt.peers[1].(*raft)
if sm.state != StateLeader {
t.Errorf("peer 1 state: %s, want %s", sm.state, StateLeader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use t.fatalf for checking the pre-condition. use t.errorf for the actual state checking.

t.Errorf("peer 3 state: %s, want %s", sm.state, StatePreCandidate)
}

sm.logger.Infof("going to bring back peer 3 and kill peer 1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the logging.

sm2 := nt.peers[2].(*raft)
sm3 := nt.peers[3].(*raft)
if sm2.state != StateLeader && sm3.state != StateLeader {
t.Errorf("no leader")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errorf needs better context. see other tests's errorf output as examples

@xiang90
Copy link
Contributor

xiang90 commented Aug 1, 2017

@bdarnell

can you take a quick look too?

raft/raft.go Outdated
@@ -611,6 +611,7 @@ func (r *raft) becomePreCandidate() {
// but doesn't change anything else. In particular it does not increase
// r.Term or change r.Vote.
r.step = stepCandidate
r.lead = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is strange to change r.lead here when we do not change r.Term. r.lead is still the leader of that term. I think we should leave r.lead unchanged here, and if we change anything, we should change the inLease check at line 711. We can't become a pre-candidate if we've heard from the leader within the lease timeout, so if state == StatePreCandidate inLease must be false.

Copy link
Contributor Author

@lishuai87 lishuai87 Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But some users may use hasLeader() to check whether cluster is in normal state. It's too tricky to let users know StatePreCandidate or inLease.

Also in raft paper, it just say:

If a follower receives no communication over a period of time called the election timeout,
then it assumes there is no viable leader and begins an election to choose a new leader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javaforfun

it assumes there is no viable leader and begins an election to choose a new leader.

begins election means increasing term, right? The pre candidate phase seems like a grey area. But I prefer to keep the leader field to be not changed as @bdarnell suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 I think thesis section 6.2 point out how to handle this situation. see #8334 (comment)

@lishuai87
Copy link
Contributor Author

I've updated this pr, please take a look. @xiang90 @bdarnell

@lishuai87
Copy link
Contributor Author

I've found another interesting scenario for Pre-Vote together with checkQuorum.

  1. cluster have 5 nodes [A, B, C, D, E], with leader A, all terms are 1.
  2. network partition happen, cluster have two partition [A, B], [C, D, E].
    so [A, B] has leader A with term 1, [C, D, E] has leader C with term 2.
  3. A will checkQuorum after election timeout, and become follower, set leader to None with term 1.
    B start pre election, but B has leader = A with term 1.

it's strange that one node think another node is leader, but the other node is not leader, and they are in same partition!
so if leader can become follower, and change r.lead to none with term not changed, follower also can do this.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use both CheckQuorum and PreVote together? In CockroachDB we were using them for the same purpose so we turned off CheckQuorum when we enabled PreVote.

I think that only part of CheckQuorum's functionality is needed when PreVote is enabled: you want the leader to step down, but you don't need to change the way votes are granted (I think). So what if we changed

https://github.com/coreos/etcd/blob/4ec31f4f7f95b8ffbacc7a3d0f7173a44b1f11e8/raft/raft.go#L712

to if !force && inLease && !r.preVote?

raft/raft.go Outdated
if r.checkQuorum {
// If network partition happen, and leader is in minority partition,
// it will step down, become follower without notifying others.
r.lead = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this change matter?

It doesn't seem that strange to me that other nodes would still consider node A to be the leader when it has stepped down due to CheckQuorum.

Copy link
Contributor Author

@lishuai87 lishuai87 Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described in thesis section 6.2, Routing requests to the leader,

Raft must also prevent stale leadership information from delaying client requests indefinitely.
Leadership information can become stale all across the system, in leaders, followers, and clients:

• Leaders: A server might be in the leader state, but if it isn’t the current leader, it could be
needlessly delaying client requests. For example, suppose a leader is partitioned from the
rest of the cluster, but it can still communicate with a particular client. Without additional
mechanism, it could delay a request from that client forever, being unable to replicate a log
entry to any other servers. Meanwhile, there might be another leader of a newer term that is
able to communicate with a majority of the cluster and would be able to commit the client’s
request. Thus, a leader in Raft steps down if an election timeout elapses without a successful
round of heartbeats to a majority of its cluster; this allows clients to retry their requests with
another server.

• Followers: Followers keep track of the leader’s identity so that they can redirect or proxy
clients. They must discard this information when starting a new election or when the term
changes. Otherwise, they might needlessly delay clients (for example, it would be possible
for two servers to redirect to each other, placing clients in an infinite loop).

followers need discard leader information when starting a new election or when the term changes.

if not, in the previous scenario, partition [A, B] has A with lead = None, B with lead = A,
if client request servers with alphabetical order, it could delay forever.

@lishuai87
Copy link
Contributor Author

CheckQuorum can prevent disruptive servers, which described in thesis section 4.2.3.

@es-chow
Copy link
Contributor

es-chow commented Aug 30, 2017

Refer to raft thesis 9.6 Preventing disruptions when a server rejoins the cluster.
it seems Pre-Vote is good for network partition or jitter. For example.

  1. Assume cluster have 3 nodes [A, B, C], the network between C and A,B is not stable, the link will jitter every hour.
  2. If without Pre-vote, node C will election periodically to make the cluster leader step down. But if with Pre-Vote, node A/B will work as node C cannot increase term to step down cluster leader.

checkQuorum seems good for disruptive server which had been removed from the cluster, but it don't resolve the issue of network jitter.

So it's good to enable both checkQuorum and Pre-vote.

Hoverbear pushed a commit to tikv/raft-rs that referenced this pull request Jul 2, 2018
We discovered some instances where prevote and check quorum interact
poorly. This commit introduces a test showcasing this.

Related to etcd-io/etcd#8334.
Hoverbear pushed a commit to tikv/raft-rs that referenced this pull request Jul 2, 2018
We discovered some instances where prevote and check quorum interact
poorly. This commit introduces a test showcasing this.

Related to etcd-io/etcd#8334.
Hoverbear pushed a commit to tikv/raft-rs that referenced this pull request Jul 3, 2018
We discovered some instances where prevote and check quorum interact
poorly. This commit introduces a test showcasing this.

Related to etcd-io/etcd#8334.
Hoverbear pushed a commit to tikv/raft-rs that referenced this pull request Jul 4, 2018
We discovered some instances where prevote and check quorum interact
poorly. This commit introduces a test showcasing this.

Related to etcd-io/etcd#8334.
Hoverbear pushed a commit to tikv/raft-rs that referenced this pull request Jul 5, 2018
* tests: Introduce prevote with check quorum test.

We discovered some instances where prevote and check quorum interact
poorly. This commit introduces a test showcasing this.

Related to etcd-io/etcd#8334.

* raft: Introduce prevote/check_quorum compatability fix.

* raft: Always reset leader in precandidate
@robin865
Copy link

robin865 commented Nov 9, 2018

So is this ever going to get merged? We've hit this issue in production, where with checkQuorum and preVote activated we are in an endless loop of ignoring MsgPreVotes. 3 nodes. Leader goes down. Other two endlessly try to PreVote and keep ignoring each other. The randomization in "pastElectionTimeout" which I think is supposed to prevent this is not working.

@lishuai87
Copy link
Contributor Author

lishuai87 commented Nov 21, 2018

So is this ever going to get merged? We've hit this issue in production, where with checkQuorum and preVote activated we are in an endless loop of ignoring MsgPreVotes. 3 nodes. Leader goes down. Other two endlessly try to PreVote and keep ignoring each other. The randomization in "pastElectionTimeout" which I think is supposed to prevent this is not working.

or maybe this: #8346

@robin865
Copy link

robin865 commented Nov 21, 2018 via email

@lishuai87
Copy link
Contributor Author

@xiang90 PTAL

raft/raft.go Outdated
@@ -611,6 +611,11 @@ func (r *raft) becomePreCandidate() {
// but doesn't change anything else. In particular it does not increase
// r.Term or change r.Vote.
r.step = stepCandidate
if r.checkQuorum {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be based on r.checkQuorum. That implies that we are doing this because we know that the leader is failing its quorum check, but we can't know that. The leader could be in the majority side of the partition and still be acting as leader.

#9586 discussed setting r.lead = None in becomePreCandidate whether checkQuorum is true or not. I think that could work here by keeping things from getting permanently stuck, but it has the odd effect that the second (pre-)campaigner wins. For that reason I still think it would be better to modify the definition of inLease at line 711 by adding && r.state == StateFollower.

On the other hand, that's a more complex and invasive change, and I think this version is simple and safe. It makes sense that pre-candidates act like they don't know the leader (same as candidates). I was previously concerned that it would be unusual for a node to transition between r.lead set and unset within a single term, but leaders already do that for checkQuorum so it should be fine.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but I think it would be better to set r.lead = None in becomeCandidate unconditionally, instead of only when r.checkQuorum is enabled.

There's also a jenkins failure that I can't see (any more?); maybe it will go away if the branch is updated.

@lishuai87 lishuai87 force-pushed the shawnsli/reset-leader-when-become-pre-candidate branch 2 times, most recently from f35ab44 to ae62732 Compare December 28, 2018 04:14
@codecov-io
Copy link

codecov-io commented Dec 28, 2018

Codecov Report

Merging #8334 into master will decrease coverage by 0.07%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8334      +/-   ##
==========================================
- Coverage   71.84%   71.77%   -0.08%     
==========================================
  Files         391      391              
  Lines       36417    36418       +1     
==========================================
- Hits        26164    26139      -25     
- Misses       8438     8464      +26     
  Partials     1815     1815
Impacted Files Coverage Δ
raft/raft.go 92.16% <100%> (+0.01%) ⬆️
proxy/grpcproxy/register.go 69.44% <0%> (-13.89%) ⬇️
etcdctl/ctlv3/command/lease_command.go 65.34% <0%> (-5.95%) ⬇️
etcdserver/v2_server.go 80.76% <0%> (-3.85%) ⬇️
etcdserver/api/v3rpc/lease.go 69.31% <0%> (-2.28%) ⬇️
raft/progress.go 94.17% <0%> (-1.95%) ⬇️
lease/leasehttp/http.go 63.97% <0%> (-1.48%) ⬇️
etcdserver/api/v3rpc/watch.go 82.35% <0%> (-1.31%) ⬇️
etcdserver/api/rafthttp/peer.go 78.77% <0%> (-1.12%) ⬇️
etcdserver/cluster_util.go 58.74% <0%> (-0.9%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6937b77...23731bf. Read the comment docs.

@lishuai87 lishuai87 force-pushed the shawnsli/reset-leader-when-become-pre-candidate branch from ae62732 to f720bc4 Compare December 28, 2018 11:41
@lishuai87 lishuai87 force-pushed the shawnsli/reset-leader-when-become-pre-candidate branch from f720bc4 to 23731bf Compare December 28, 2018 11:59
@lishuai87
Copy link
Contributor Author

@bdarnell I've updated the test case.

@xiang90
Copy link
Contributor

xiang90 commented Dec 28, 2018

lgtm

@xiang90 xiang90 merged commit cc8d446 into etcd-io:master Dec 28, 2018
@lishuai87 lishuai87 deleted the shawnsli/reset-leader-when-become-pre-candidate branch December 29, 2018 11:36
@jkessler93
Copy link

@lishuai87 Would it be possible for you to release a new version with this fix? I am currently experiencing this bug.

@lishuai87
Copy link
Contributor Author

@lishuai87 Would it be possible for you to release a new version with this fix? I am currently experiencing this bug.

@xiang90 @gyuho PTAL

@brk0v
Copy link

brk0v commented Jan 18, 2019

+1. Can reproduce this with my jepsen test as well.

sbenderli pushed a commit to sbenderli/etcd that referenced this pull request Mar 21, 2019
gyuho added a commit that referenced this pull request Jul 23, 2019
hexfusion pushed a commit to openshift/etcd that referenced this pull request Sep 26, 2019
shbieng pushed a commit to shbieng/raft-rs that referenced this pull request Sep 19, 2022
* tests: Introduce prevote with check quorum test.

We discovered some instances where prevote and check quorum interact
poorly. This commit introduces a test showcasing this.

Related to etcd-io/etcd#8334.

* raft: Introduce prevote/check_quorum compatability fix.

* raft: Always reset leader in precandidate
siennathesane pushed a commit to shieldmaidens/pleiades that referenced this pull request Nov 4, 2023
* tests: Introduce prevote with check quorum test.

We discovered some instances where prevote and check quorum interact
poorly. This commit introduces a test showcasing this.

Related to etcd-io/etcd#8334.

* raft: Introduce prevote/check_quorum compatability fix.

* raft: Always reset leader in precandidate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

8 participants